20 research outputs found

    On the enhancement of Big Data Pipelines through Data Preparation, Data Quality, and the distribution of Optimisation Problems

    Get PDF
    Nowadays, data are fundamental for companies, providing operational support by facilitating daily transactions. Data has also become the cornerstone of strategic decision-making processes in businesses. For this purpose, there are numerous techniques that allow to extract knowledge and value from data. For example, optimisation algorithms excel at supporting decision-making processes to improve the use of resources, time and costs in the organisation. In the current industrial context, organisations usually rely on business processes to orchestrate their daily activities while collecting large amounts of information from heterogeneous sources. Therefore, the support of Big Data technologies (which are based on distributed environments) is required given the volume, variety and speed of data. Then, in order to extract value from the data, a set of techniques or activities is applied in an orderly way and at different stages. This set of techniques or activities, which facilitate the acquisition, preparation, and analysis of data, is known in the literature as Big Data pipelines. In this thesis, the improvement of three stages of the Big Data pipelines is tackled: Data Preparation, Data Quality assessment, and Data Analysis. These improvements can be addressed from an individual perspective, by focussing on each stage, or from a more complex and global perspective, implying the coordination of these stages to create data workflows. The first stage to improve is the Data Preparation by supporting the preparation of data with complex structures (i.e., data with various levels of nested structures, such as arrays). Shortcomings have been found in the literature and current technologies for transforming complex data in a simple way. Therefore, this thesis aims to improve the Data Preparation stage through Domain-Specific Languages (DSLs). Specifically, two DSLs are proposed for different use cases. While one of them is a general-purpose Data Transformation language, the other is a DSL aimed at extracting event logs in a standard format for process mining algorithms. The second area for improvement is related to the assessment of Data Quality. Depending on the type of Data Analysis algorithm, poor-quality data can seriously skew the results. A clear example are optimisation algorithms. If the data are not sufficiently accurate and complete, the search space can be severely affected. Therefore, this thesis formulates a methodology for modelling Data Quality rules adjusted to the context of use, as well as a tool that facilitates the automation of their assessment. This allows to discard the data that do not meet the quality criteria defined by the organisation. In addition, the proposal includes a framework that helps to select actions to improve the usability of the data. The third and last proposal involves the Data Analysis stage. In this case, this thesis faces the challenge of supporting the use of optimisation problems in Big Data pipelines. There is a lack of methodological solutions that allow computing exhaustive optimisation problems in distributed environments (i.e., those optimisation problems that guarantee the finding of an optimal solution by exploring the whole search space). The resolution of this type of problem in the Big Data context is computationally complex, and can be NP-complete. This is caused by two different factors. On the one hand, the search space can increase significantly as the amount of data to be processed by the optimisation algorithms increases. This challenge is addressed through a technique to generate and group problems with distributed data. On the other hand, processing optimisation problems with complex models and large search spaces in distributed environments is not trivial. Therefore, a proposal is presented for a particular case in this type of scenario. As a result, this thesis develops methodologies that have been published in scientific journals and conferences.The methodologies have been implemented in software tools that are integrated with the Apache Spark data processing engine. The solutions have been validated through tests and use cases with real datasets

    CHAMALEON: Framework to improve Data Wrangling with Complex Data

    Get PDF
    Data transformation and schema conciliation are relevant topics in Industry due to the incorporation of data-intensive business processes in organizations. As the amount of data sources increases, the complexity of such data increases as well, leading to complex and nested data schemata. Nowadays, novel approaches are being employed in academia and Industry to assist non-expert users in transforming, integrating, and improving the quality of datasets (i.e., data wrangling). However, there is a lack of support for transforming semi-structured complex data. This article makes an state-of-the-art by identifying and analyzing the most relevant solutions that can be found in academia and Industry to transform this type of data. In addition, we propose a Domain-Specific Language (DSL) to support the transformation of complex data as a first approach to enhance data wrangling processes. We also develop a framework to implement the DSL and evaluate it in a real-world case study

    DMN for Data Quality Measurement and Assessment

    Get PDF
    Data Quality assessment is aimed at evaluating the suitability of a dataset for an intended task. The extensive literature on data quality describes the various methodologies for assessing data quality by means of data profiling techniques of the whole datasets. Our investigations are aimed to provide solutions to the need of automatically assessing the level of quality of the records of a dataset, where data profiling tools do not provide an adequate level of information. As most of the times, it is easier to describe when a record has quality enough than calculating a qualitative indicator, we propose a semi-automatically business rule-guided data quality assessment methodology for every record. This involves first listing the business rules that describe the data (data requirements), then those describing how to produce measures (business rules for data quality measurements), and finally, those defining how to assess the level of data quality of a data set (business rules for data quality assessment). The main contribution of this paper is the adoption of the OMG standard DMN (Decision Model and Notation) to support the data quality requirement description and their automatic assessment by using the existing DMN engines.Ministerio de Ciencia y Tecnología RTI2018-094283-B-C33Ministerio de Ciencia y Tecnología RTI2018-094283-B-C31European Regional Development Fund SBPLY/17/180501/00029

    Enabling Process Mining in Aircraft Manufactures: Extracting Event Logs and Discovering Processes from Complex Data

    Get PDF
    Process mining is employed by organizations to completely understand and improve their processes and to detect possible deviations from expected behavior. Process discovery uses event logs as input data, which describe the times of the actions that occur the traces. Currently, Internet-of-Things environments generate massive distributed and not always structured data, which brings about new complex scenarios since data must first be transformed in order to be handled by process min ing tools. This paper shows the success case of application of a solution that permits the transformation of complex semi-structured data of an assembly-aircraft process in order to create event logs that can be man aged by the process mining paradigm. A Domain-Specific Language and a prototype have been implemented to facilitate the extraction of data into the unified traces of an event log. The implementation performed has been applied within a project in the aeronautic industry, and promis ing results have been obtained of the log extraction for the discovery of processes and the resulting improvement of the assembly-aircraft process.Ministerio de Ciencia y Tecnología RTI2018-094283-B-C3

    Treatment with tocilizumab or corticosteroids for COVID-19 patients with hyperinflammatory state: a multicentre cohort study (SAM-COVID-19)

    Get PDF
    Objectives: The objective of this study was to estimate the association between tocilizumab or corticosteroids and the risk of intubation or death in patients with coronavirus disease 19 (COVID-19) with a hyperinflammatory state according to clinical and laboratory parameters. Methods: A cohort study was performed in 60 Spanish hospitals including 778 patients with COVID-19 and clinical and laboratory data indicative of a hyperinflammatory state. Treatment was mainly with tocilizumab, an intermediate-high dose of corticosteroids (IHDC), a pulse dose of corticosteroids (PDC), combination therapy, or no treatment. Primary outcome was intubation or death; follow-up was 21 days. Propensity score-adjusted estimations using Cox regression (logistic regression if needed) were calculated. Propensity scores were used as confounders, matching variables and for the inverse probability of treatment weights (IPTWs). Results: In all, 88, 117, 78 and 151 patients treated with tocilizumab, IHDC, PDC, and combination therapy, respectively, were compared with 344 untreated patients. The primary endpoint occurred in 10 (11.4%), 27 (23.1%), 12 (15.4%), 40 (25.6%) and 69 (21.1%), respectively. The IPTW-based hazard ratios (odds ratio for combination therapy) for the primary endpoint were 0.32 (95%CI 0.22-0.47; p < 0.001) for tocilizumab, 0.82 (0.71-1.30; p 0.82) for IHDC, 0.61 (0.43-0.86; p 0.006) for PDC, and 1.17 (0.86-1.58; p 0.30) for combination therapy. Other applications of the propensity score provided similar results, but were not significant for PDC. Tocilizumab was also associated with lower hazard of death alone in IPTW analysis (0.07; 0.02-0.17; p < 0.001). Conclusions: Tocilizumab might be useful in COVID-19 patients with a hyperinflammatory state and should be prioritized for randomized trials in this situatio

    Outcomes from elective colorectal cancer surgery during the SARS-CoV-2 pandemic

    Get PDF
    This study aimed to describe the change in surgical practice and the impact of SARS-CoV-2 on mortality after surgical resection of colorectal cancer during the initial phases of the SARS-CoV-2 pandemic

    Analysis of Big Data Architectures and Pipelines: Challenges and Opportunities

    No full text
    Los continuos avances tecnológicos están promoviendo cambios en múltiples aspectos de la sociedad. Una de las consecuencias de estos avances y cambios sociales es el aumento de la cantidad de datos que se generan día tras día. En este escenario, Big Data ha emergido como uno de los paradigmas más disruptivos de los últimos tiempos, siendo de gran interés para múltiples tipos de organizaciones. Este interés se debe a que Big Data está permitiendo a las organizaciones a extraer valor de los datos que tienen a su disposición. Al mismo tiempo, Big Data está promoviendo más cambios tecnológicos que están aumentando el potencial valor que se puede extraer de los datos. Este valor permite a las empresas aumentar y optimizar su capacidad productiva, contribuyendo a la mejora de sus ventajas competitivas, y facilitando la toma de decisiones. Como consecuencia, Big Data se ha convertido en uno de los campos más estudiados, tanto en la literatura como en la Industria. Se trata de un campo que está en continua evolución y que presenta unos retos y oportunidades muy sustanciales que podrían aumentar la calidad del proceso de extracción de valor de los datos. Sin embargo, al ser un campo en continua evolución, se requiere un estudio detallado y conciso de todos los aspectos relacionados con este. Este trabajo realiza un estudio sobre el estado del arte y los conceptos relacionados con el paradigma Big Data, las actividades y técnicas relacionadas con el proceso de extracción de valor de los datos, y las arquitecturas de procesamiento de los mismos. Este estudio se estructura en tres partes. En la primera, se contextualizan los conceptos y actividades relacionadas con el paradigma Big Data, proponiendo una visión global de este. En segundo lugar, se identifican las principales limitaciones, retos, oportunidades, y posibles líneas de investigación relacionadas con el paradigma Big Data. Por último, se propone una solución a uno de los retos de investigación que se plantean en este estudio: la preparación de datos con estructuras complejas.The continuous technological advances are promoting changes in multiple aspects of society. One of the consequences of these advances is the increase in the amount of data that is daily generated. In this scenario, Big Data has emerged as one of the most disruptive paradigms in recent years, becoming a matter of great interest for multiple types of organizations. This interest is due to the fact that Big Data is enabling organizations to extract value from the data they own. At the same time, Big Data is promoting more technical changes that are increasing the potential value that can be extracted from data. This value enables companies to increase and optimize their productive capacity, contributing to increase their competitive advantages, and to ease the decision making process. As a result, Big Data has become one of the most studied fields, both in literature and in Industry. Consequently, it is constantly evolving, and presents significant challenges and opportunities that could increase the quality of the process of value extraction from data. However, since the Big Data paradigm is continually evolving, a detailed and concise study about all aspects related to it is required. In this work, a research about the state-of-the-art of the Big Data paradigm is carried out. The concepts related to it, the activities and techniques on the value extraction process, and the data processing architectures are studied. Next, the main limitations, challenges, opportunities, and possible research lines related to the Big Data paradigm are identified. Finally, a solution to one of the research challenges that arise in this study is proposed: a framework to deal with the preparation of data with complex structures.Universidad de Sevilla. Máster en Ingeniería Informátic

    CC4Spark: Distributing event logs and big complex conformance checking problems

    Get PDF
    Conformance checking is one of the disciplines that best exposes the power of process mining, since it allows detecting anomalies and deviations in business processes, helping to assess and improve the quality of these. This is an indispensable task, especially in Big Data environments where large amounts of data are generated, and where the complexity of the processes is increasing. CC4Spark enables companies to face this challenging scenario in twofold. First, it supports distributing conformance checking alignment problems by means of a Big Data infrastructure based on Apache Spark, allowing users to import, transform and prepare event logs stored in distributed data sources, and solve them in a distributed environment. Secondly, this tool supports decomposed Petri nets. This helps to noticeably reduce the complexity of the models. Both characteristics help companies in facing increasingly frequent scenarios with large amounts of logs with highly complex business processes. CC4Spark is not tied to any particular conformance checking algorithm, so that users can employ customised algorithms.Ministry of Science and Technology of Spain: ECLIPSE (RTI2018-094283-B-C33) project; European Regional Development Fund (ERDF/FEDER); MINECO (TIN2017-86727-C2-1-R); University of Seville: VI Plan Propio de Investigación y Transferencia (VI PPIT-US).Peer ReviewedPostprint (published version

    when data quality meets DMN

    No full text
    To succeed in their business processes, organizations need data that not only attains suitable levels of quality for the task at hand, but that can also be considered as usable for the business. However, many researchers ground the potential usability of the data on its quality. Organizations would benefit from receiving recommendations on the usability of the data before its use. We propose that the recommendation on the usability of the data be supported by a decision process, which includes a context-dependent data-quality assessment based on business rules. Ideally, this recommendation would be generated automatically. Decision Model and Notation (DMN) enables the assessment of data quality based on the evaluation of business rules, and also, provides stakeholders (e.g., data stewards) with sound support for the automation of the whole process of generation of a recommendation regarding usability based on data quality. The main contribution of the proposal involves designing and enabling both DMN-driven mechanisms and a guiding methodology (DMN4DQ) to support the automatic generation of a decision-based recommendation on the potential usability of a data record in terms of its level of data quality. Furthermore, the validation of the proposal is performed through the application of a real dataset
    corecore